Objective¶

Traditional feature selection is conducted base on feature importance. With the development of machine learning interpretion, multiple methods are generated to interpret the 'black box' of machine learning. One of the most widely used is SHAP. When I worked on this project, I was thinking maybe SHAP can also used on feature selection to improve the model.

Some researches are supporting my thought. Like https://iopscience.iop.org/article/10.1088/1742-6596/1284/1/012026/pdf. In this paper, the author shows that new feature selection method base on the SHAP values is superior to widely used prediction methods.

This notebook aims to test wether SHAP value used in feature selection is better for this sepcific project.

Project Objective¶

The objective of this project is to develop a customer resiliency score to help rank customers' resilience to some unexpected disaster or an economic downturn. This scoring model will be used on all cunsumer porfolios. It will predict the likelihood of the customer having a FICO score drop of 40 points or worse in the next six months.

import pandas as pd
from importlib import reload
import numpy as np

pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 
pd.set_option('float_format', '{:f}'.format)
from IPython.display import display, HTML
import warnings
warnings.filterwarnings("ignore")

Input Variables¶

Deposit Data (savings, DDA, other relationship data), 440 variables
Credit Attributes, 887 variables
Demographic data ,103 variables
Income data (internal income, stated income, mortgage income, income repository), 13 variables

df1=pd.read_parquet('../../Data/df_combined.parquet')
dfOOt=pd.read_parquet('../../Data/df_combined_oot_1snap.parquet')

df1=df1.loc[(df1['target']>=0)]

catCols=['in_cma_cus_state','in_cma_score_type_tru','in_cma_score_type','in_cma_fraud_alert_ind','in_cma_addr_disc_ind'
         ,'in_cma_vantage_rsn_5','in_cma_bni_rsn_5','in_cma_system','Income_Confidence'
         ,'in_cma_reject_ind','Income_Confidence1','Income_Confidence2'
         ,'Income_Confidence3']

excludeCols=['in_cma_account','in_cma_score_date','in_cma_eq_seqnum','income_date_x','income_date_y'
             ,'in_cma_cus_ssn','in_cma_cus_zipcode'
             ,'in_cma_score_date','in_cma_cus_name_1','in_cma_cus_name_2'
             ,'in_cma_project_id','in_cma_cus_addr_1','in_cma_cus_addr_2'
             ,'in_cma_cus_city','dob','BATCH','in_tsys_fill','in_acls_bank','in_acls_loan'
             ,'in_acls_appl','in_acls_fill','in_tsys_acct','in_tsys_fill','in_cma_cus_account_num'
             ,'in_cma_orig_acct_num','reported_inc_hit','ods_business_dt','diff_score','lag_score'] #dfBase_2lag

dfTotal=df1.drop(excludeCols,1)
print(dfTotal.shape)
dfTotal=dfTotal.drop_duplicates(keep="first")
print(dfTotal.shape)

(283229, 1399)
(283229, 1399)

# One-hot encode is used on x
dfTotal = pd.get_dummies(dfTotal, columns=catCols)

dfTotal = dfTotal.astype(float)

# xgboost can handle nan value property. But in general, we need to handle nan for other classification models.
dfTotal=dfTotal.fillna(-9999)

dfTotal.target.value_counts()

0.000000    247736
1.000000    35493 
Name: target, dtype: int64

# Remove Collinear Variables
threshold = 0.8
corr_matrix = dfTotal.corr().abs()
print(dfTotal.shape)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
dfTotal = dfTotal.drop(columns = to_drop)
print(dfTotal.shape)

(283229, 1490)
(283229, 548)

from sklearn.model_selection import train_test_split
x, y=dfTotal.drop('target',1), dfTotal['target']
train, test = train_test_split(dfTotal, test_size=0.25)

x_train=train.drop('target',1)
y_train=train['target']
x_test=test.drop('target',1)
y_test=test['target']

from sklearn.metrics import roc_curve
def ks_cal(y_test,y_pred):
    fpr,tpr,thresholds=roc_curve(y_test,y_pred)
    ks=max(tpr-fpr)
    return ks

In this project, xgboost is the best model. Here, we use xgboost with default parameters to test the feature selection methodologies.

from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(x_train, y_train)
print('importance type: '+str(imtype)+' , train roc_auc_score: '+ str(roc_auc_score(y_train, model.predict_proba(x_train).T[1])))
print('importance type: '+str(imtype)+' , test roc_auc_score: '+ str(roc_auc_score(y_test, model.predict_proba(x_test).T[1])))

importance type: total_cover , train roc_auc_score: 0.8587836168577115
importance type: total_cover , test roc_auc_score: 0.7217378594650992

We set a base line by using all variables without any feature selection conducted, the test ROC_AUC_score is 0.72.

for imtype in ['gain','weight','cover','total_gain','total_cover'] :
    model = XGBClassifier()
    model.fit(x_train, y_train)
    feature_important = model.get_booster().get_score(importance_type=imtype)
    keys = list(feature_important.keys())
    values = list(feature_important.values())
    dfImportances = pd.DataFrame(data=values, index=keys, columns=["importance"]).sort_values(by = "importance", ascending=False)
    # data.plot(kind='barh')
    dfImportances.head()

    # select top by the gain
    top=30
    x_train2= x_train[dfImportances.head(top).index.tolist()]
    x_test2= x_test[dfImportances.head(top).index.tolist()]
    model.fit(x_train2, y_train)
    print('importance type: '+str(imtype)+' , train roc_auc_score: '+ str(roc_auc_score(y_train, model.predict_proba(x_train2).T[1])))
    print('importance type: '+str(imtype)+' , test roc_auc_score: '+ str(roc_auc_score(y_test, model.predict_proba(x_test2).T[1])))

importance type: gain , train roc_auc_score: 0.7784419892552842
importance type: gain , test roc_auc_score: 0.6963464927485127
importance type: weight , train roc_auc_score: 0.8168131180579382
importance type: weight , test roc_auc_score: 0.6943462903759027
importance type: cover , train roc_auc_score: 0.6515482783836781
importance type: cover , test roc_auc_score: 0.6036780697232002
importance type: total_gain , train roc_auc_score: 0.8068128919942879
importance type: total_gain , test roc_auc_score: 0.7017535687592438
importance type: total_cover , train roc_auc_score: 0.8077814989306732
importance type: total_cover , test roc_auc_score: 0.7012402563905462

According to the xgboost document (https://xgboost.readthedocs.io/en/latest/python/python_api.html), it uses 5 type feature importance, gain, weight, cover, total_gain and toal_cover. You can find the detail infroamtion from the document. Here, we loop through each type. The result shows, for our data, total_gain and total_cover perform better that the others. The total_gain achieves the highest ROC_AUC_socre of 0.702.

# select top by shap value
import shap
from statsmodels.api import add_constant
model.fit(x_train, y_train)
explainer=shap.TreeExplainer(model)
shap_values=explainer.shap_values(x_train)
features_order=pd.DataFrame(x_train.columns, columns=['Features'])
features_order['Shap Feature Importance']=np.sum(np.abs(shap_values),axis=0)
features_order=features_order.sort_values('Shap Feature Importance', ascending=False).reset_index(drop=True)
x_train_left=add_constant(x_train[features_order.iloc[:300]['Features'].tolist()])
features_order1=pd.DataFrame(x_train_left.columns,columns=['Features'])
features_order1=features_order1.merge(features_order,on='Features',how='left')
# features_order1["VIF Factor"]=[variance_inflation_factor(x_train_left.values, i) for i in tqdm(range(x_train_left.shape[1]))]
features_order1=features_order1[~features_order1['Features'].isin(['const'])]
# features_order1=features_order1[features_order1["VIF Factor"] < 5].reset_index(drop=True)

# select top by the shap value
x_train3= x_train[features_order1.head(30)['Features'].tolist()]
x_test3= x_test[features_order1.head(30)['Features'].tolist()]
model.fit(x_train3, y_train)
print(roc_auc_score(y_train, model.predict_proba(x_train3).T[1]))
print(roc_auc_score(y_test, model.predict_proba(x_test3).T[1]))

0.8136590149724379
0.7004414202353029

SHAP value is a novel ensemble learning measurement that is the unique consistent and locally accurate attribute value. It based on ideas from game theory and local explanations. You can find the detail documentation https://github.com/slundberg/shap#citations.

For our data set, by using shape value, and limit the same amount of variables, we get almost the same classification result. The SHAP value does not improve the model much. But it is always good to try.